5 research outputs found
A Kernel Independence Test for Random Processes
A new non parametric approach to the problem of testing the independence of
two random process is developed. The test statistic is the Hilbert Schmidt
Independence Criterion (HSIC), which was used previously in testing
independence for i.i.d pairs of variables. The asymptotic behaviour of HSIC is
established when computed from samples drawn from random processes. It is shown
that earlier bootstrap procedures which worked in the i.i.d. case will fail for
random processes, and an alternative consistent estimate of the p-values is
proposed. Tests on artificial data and real-world Forex data indicate that the
new test procedure discovers dependence which is missed by linear approaches,
while the earlier bootstrap procedure returns an elevated number of false
positives. The code is available online:
https://github.com/kacperChwialkowski/HSIC .Comment: In Proceedings of The 31st International Conference on Machine
Learnin
Interpretable Distribution Features with Maximum Testing Power
Two semimetrics on probability distributions are proposed, given as the sum
of differences of expectations of analytic functions evaluated at spatial or
frequency locations (i.e, features). The features are chosen so as to maximize
the distinguishability of the distributions, by optimizing a lower bound on
test power for a statistical test using these features. The result is a
parsimonious and interpretable indication of how and where two distributions
differ locally. An empirical estimate of the test power criterion converges
with increasing sample size, ensuring the quality of the returned features. In
real-world benchmarks on high-dimensional text and image data, linear-time
tests using the proposed semimetrics achieve comparable performance to the
state-of-the-art quadratic-time maximum mean discrepancy test, while returning
human-interpretable features that explain the test results
Fast Two-Sample Testing with Analytic Representations of Probability Measures
We propose a class of nonparametric two-sample tests with a cost linear in
the sample size. Two tests are given, both based on an ensemble of distances
between analytic functions representing each of the distributions. The first
test uses smoothed empirical characteristic functions to represent the
distributions, the second uses distribution embeddings in a reproducing kernel
Hilbert space. Analyticity implies that differences in the distributions may be
detected almost surely at a finite number of randomly chosen
locations/frequencies. The new tests are consistent against a larger class of
alternatives than the previous linear-time tests based on the (non-smoothed)
empirical characteristic functions, while being much faster than the current
state-of-the-art quadratic-time kernel-based or energy distance-based tests.
Experiments on artificial benchmarks and on challenging real-world testing
problems demonstrate that our tests give a better power/time tradeoff than
competing approaches, and in some cases, better outright power than even the
most expensive quadratic-time tests. This performance advantage is retained
even in high dimensions, and in cases where the difference in distributions is
not observable with low order statistics